Skip to content

Conversation

@nipunsadvilkar
Copy link
Owner

PySBD component using Language.factory

@codecov-commenter
Copy link

codecov-commenter commented Jun 29, 2022

Codecov Report

Merging #114 (e07808a) into master (5905f13) will decrease coverage by 0.08%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##           master     #114      +/-   ##
==========================================
- Coverage   98.43%   98.35%   -0.09%     
==========================================
  Files          38       39       +1     
  Lines        1150     1153       +3     
==========================================
+ Hits         1132     1134       +2     
- Misses         18       19       +1     
Flag Coverage Δ
unittests 98.35% <50.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pysbd/utils.py 73.33% <42.85%> (-2.53%) ⬇️
pysbd/about.py 100.00% <100.00%> (ø)
pysbd/__init__.py 100.00% <0.00%> (ø)

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Comment on lines -105 to -107
},
entry_points={
"spacy_factories": ["pysbd = pysbd.utils:PySBDFactory"]
Copy link
Owner Author

@nipunsadvilkar nipunsadvilkar Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmitsch I would have to remove this entrypoint now as spacy uses @Language.factory decorator compulsorily in spacy v3 to register a custom component and since PySBDFactory resides at pysbd/utils.py, I would need to add spacy>=3 requirement to pysbd's setup.py

I wish to keep pysbd lightweight (use only inbuilt python modules).
Do you have any thoughts on this? Like other way around?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's tricky. You could have a look at vendoring @Language.factory. You'd definitely need the registry functionality which can be found in https://github.com/explosion/catalogue now. It's still relatively lightweight, but it's already breaking your requirement of only having inbuilt Python modules.

How's spacy_factories used within PSBD?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not used in pysbd.

psybd python library is shipping psybd named spaCy component out-of-the-box via entrypoints.

Given a python environment with spacy and pysbd installed, nlp.add_pipe("pysbd") will work without importing pysbd explicitly.

More info here: https://spacy.io/usage/saving-loading#entry-points-components

Copy link

@rmitsch rmitsch Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively: you could offer pybsd and pybsd[spacy], with only the latter supporting the usage as a spaCy v3.x component and installing spaCy by default.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, was thinking of doing this. Will look into it 👍🏼

print('sent_id', 'sentence', sep='\t|\t')
for sent_id, sent in enumerate(doc.sents, start=1):
print(sent_id, sent.text, sep='\t|\t')
print(sent_id, repr(sent.text), sep='\t|\t')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity: why is the repr() necessary here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary at all. I add it sometimes just to see raw representation of the sentence. Especially in given example to see whether trailing spaces are captured properly or not 😅

Screenshot with repr vs without
image

Comment on lines -105 to -107
},
entry_points={
"spacy_factories": ["pysbd = pysbd.utils:PySBDFactory"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's tricky. You could have a look at vendoring @Language.factory. You'd definitely need the registry functionality which can be found in https://github.com/explosion/catalogue now. It's still relatively lightweight, but it's already breaking your requirement of only having inbuilt Python modules.

How's spacy_factories used within PSBD?

@davidberenstein1957
Copy link

Are you still working on this? Otherwise I could have a look.

@nipunsadvilkar
Copy link
Owner Author

Hey @davidberenstein1957, sure you can take a look at it.
But I'm not sure what would be best way since I want to keep pysbd lightweight and to support psybd with spacy v3 with Language.factory is needed and which would make me add spacy as dependency.

Let me know if you happen to work on the recommendations suggested by @rmitsch above.

@rbroderi
Copy link

rbroderi commented Mar 2, 2024

here would be an option to update the factory method and not require spacey as a hard requirement to pysbd.

from typing import Any
try:
    from spacy.language import Language
    langfac = Language.factory
except ImportError:
    def langfac(*args:Any,**kwargs:Any):
        def decorator(function:Any):
            def wrapper(*args:Any, **kwargs:Any):
                pass
            return wrapper
        return decorator
@langfac(name="pysbd",default_config={"language": 'en'})
class PySBDFactory(object):
    """pysbd as a spacy component through entrypoints"""

    def __init__(self, nlp, name,language='en'):
        self.nlp = nlp
        self.name = name
        self.seg = pysbd.Segmenter(language=language, clean=False,
                                   char_span=True)

    def __call__(self, doc):
        sents_char_spans = self.seg.segment(doc.text_with_ws)
        start_token_ids = [sent.start for sent in sents_char_spans]
        for token in doc:
            token.is_sent_start = (True if token.idx
                                   in start_token_ids else False)
        return doc

`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants